Face detection and tracking in a video sequence

ABSTRACT

A method ( 100 ) and apparatus ( 700 ) are disclosed for detecting and tracking human faces across a sequence of video frames. Spatiotemporal segmentation is used to segment ( 115 ) the sequence of video frames into 3D segments. 2D segments are then formed from the 3D segments, with each 2D segment being associated with one 3D segment. Features are extracted ( 140 ) from the 2D segments and grouped into groups of features. For each group of features, a probability that the group of features includes human facial features is calculated ( 145 ) based on the similarity of the geometry of the group of features with the geometry of a human face model. Each group of features is also matched with a group of features in a previous 2D segment and an accumulated probability that said group of features includes human facial features is calculated ( 150 ). Each 2D segment is classified ( 155 ) as a face segment or a non-face segment based on the accumulated probability. Human faces are then tracked by finding 2D segments in subsequent frames associated with 3D segments associated with face segments.

TECHNICAL FIELD OF THE INVENTION

The present invention relates generally to video processing and, inparticular, to the detection and tracking of human faces in a videosequence.

BACKGROUND

Interpretation of a video sequence for human-machine interface purposesis a difficulty often encountered in the image processing industry. Facetracking in particular is one of the most important aspects of suchinterpretation of video sequences and may be classified as a high levelproblem, and is often an important initial step in many otherapplications, including face recognition. Another application is contentsummarisation, in which an object-based description of the video contentis compiled for indexing, browsing, and searching functionalities. Yetanother application is active camera control, in which the parameters ofa camera may be altered to optimise the filming of detected faces.

Typically, face tracking is divided into two separate steps. First,frames of the video sequence are analyzed to detect the location of oneor more faces. When a face is detected, that face is then tracked untilits disappearance.

A cue often used in the detection of faces is skin colour. Some knownface detection and tracking methods proceed by labelling a detectedobject having the colour of skin as being a face, and track such objectsthrough time. More sophisticated techniques further analyse eachdetected object having the colour of skin to determine whether theobject includes facial features, like eyes and mouth, in order to verifythat the object is in fact a face. However, whilst this technique isfast, such is unreliable. The reason for the unreliability is that skincolour changes under different lighting conditions, causing the skindetection to become unstable.

Other techniques use motion and shape as the main cues. Whenever anelliptical contour is detected within a frame, the object is labelled asa face. Hence, these techniques use a very simple model of the face,that being an ellipse, and assume that the face is moving through thevideo sequence. Static faces would therefore not be detected.

SUMMARY OF THE INVENTION

It is an object of the present invention to substantially overcome, orat least ameliorate, one or more disadvantages of existing arrangements.

According to a first aspect of the invention, there is provided a methodof detecting and tracking human faces across a sequence of video frames,said method comprising the steps of:

(a) forming a 3D pixel data block from said sequence of video frames;

(b) segmenting said 3D data block into a set of 3D segments using 3Dspatiotemporal segmentation;

(c) forming 2D segments from an intersection of said 3D segments with aview plane, each 2D segment being associated with one 3D segment;

(d) in at least one of said 2D segments, extracting features andgrouping said features into one or more groups of features;

(e) for each group of features, computing a probability that said groupof features represents human facial features based on the similarity ofthe geometry of said group of features with the geometry of a human facemodel;

(f) matching at least one group of features with a group of features ina previous 2D segment and computing an accumulated probability that saidgroup of features represents human facial features using probabilitiesof matched groups of features;

(g) classifying each 2D segment as a face segment or a non-face segmentbased on said accumulated probability of at least one group of featuresin each of said 2D segments; and

(h) tracking said human faces by finding an intersection of 3D segmentsassociated with said face segments with at least subsequent view planes.

According to another aspect of the invention, there is provided anapparatus for implementing the aforementioned method.

According to another aspect of the invention there is provided acomputer program for implementing the method described above.

Other aspects of the invention are also disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

A number of embodiments of the present invention will now be describedwith reference to the drawings, in which:

FIG. 1 is a schematic block diagram representation of a programmabledevice in which arrangements described may be implemented;

FIG. 2 shows a flow diagram of the main processing steps of a method ofdetecting and tracking human faces across a sequence of video frames;

FIG. 3 shows a flow diagram of the sub-steps of the facial featureextraction step;

FIG. 4 shows a flow diagram of the sub-steps for calculating aprobability for triangles formed in segments;

FIG. 5 shows a flow diagram of the sub-steps for calculating anaccumulated probability;

FIG. 6 illustrates a sequence of video frames, with a window includingthe most recently received frames, forming a “block” of pixel data;

FIG. 7 shows a flow diagram of the sub-steps of a 3D-segmentation step;

FIG. 8 shows an example of the triangle formed between the centroids ofthree possible facial features and the angle α that the uppermost lineof the triangle makes with the horizontal; and

FIG. 9 shows a flow diagram of the sub-steps of a segment pre-filteringstep.

DETAILED DESCRIPTION INCLUDING BEST MODE

Where reference is made in any one or more of the accompanying drawingsto steps and/or features, which have the same reference numerals, thosesteps and/or features have for the purposes of this description the samefunction(s) or operation(s), unless the contrary intention appears.

Some portions of the description which follows are explicitly orimplicitly presented in terms of algorithms and symbolic representationsof operations on data is within a computer memory. These algorithmicdescriptions and representations are the means used by those skilled inthe data processing arts to most effectively convey the substance oftheir work to others skilled in the art. An algorithm is here, andgenerally, conceived to be a self-consistent sequence of steps leadingto a desired result. The steps are those requiring physicalmanipulations of physical quantities.

Apparatus

FIG. 1 shows a programmable device 700 for performing the operations ofa human face detection and tracking method described below. Such aprogrammable device 700 may be specially constructed for the requiredpurposes, or may comprise a general-purpose computer or other deviceselectively activated or reconfigured by a computer program stored inthe computer or device. The algorithms presented herein are notinherently related to any particular computer or other apparatus.

The programmable device 700 comprises a computer module 701, inputdevices such as a camera 750, a keyboard 702 and mouse 703, and outputdevices including a display device 714. A Modulator-Demodulator (Modem)transceiver device 716 is used by the computer module 701 forcommunicating to and from a communications network 720, for exampleconnectable via a telephone line 721 or other functional medium. Themodem 716 can be used to obtain access to the Internet, and othernetwork systems, such as a Local Area Network (LAN) or a Wide AreaNetwork (WAN).

The computer module 701 typically includes at least one processor unit705, a memory unit 706, for example formed from semiconductor randomaccess memory (RAM) and read only memory (TOM), input/output (I/O)interfaces including a video interface 707, and an 1,0 interface 713 forthe keyboard 702 and mouse 703, and an interface 708 for the modem 716and the camera 750 through connection 748. A storage device 709 isprovided and typically includes a hard disk drive 710 and a floppy diskdrive 711. A CD-ROM drive 712 is typically provided as a non-volatilesource of data. The components 705 to 713 of the computer module 701,typically communicate via an interconnected bus 704 and in a mannerwhich results in a conventional mode of operation of the programmabledevice 700 known to those in the relevant art.

The programmable device 700 may be constructed from one or moreintegrated circuits performing the functions or sub-functions of thehuman face detection and tracking method, and for example incorporatedin the digital video camera 750. As seen, the camera 750 includes adisplay screen 752, which can be used to display a video sequence andinformation regarding the same.

The method may be implemented as software, such as an applicationprogram executing within the programmable device 700. The applicationprogram may be stored on a computer readable medium, including thestorage devices 709. The application program is read into the computerfrom the computer readable medium, and then executed by the processor705. A computer readable medium having such software or computer programrecorded on it is a computer program product. Intermediate storage ofthe program and any data fetched from the network 720 and camera 750 maybe accomplished using the semiconductor memory 706, possibly in concertwith the hard disk drive 710. In some instances, the application programmay be supplied to the user encoded on a CD-ROM or floppy disk and readvia the corresponding drive 712 or 711, or alternatively may be read bythe user from the network 720 via the modem device 716. The foregoing ismerely exemplary of, relevant computer readable mediums. Other computerreadable media may be practiced without departing from the scope andspirit of the invention.

The use of the computer program product in the programmable device 700preferably effects an advantageous apparatus for detecting and trackinghuman faces across a sequence of video frames.

Human Face Detection and Tracking Method

FIG. 2 shows a schematic flow diagram of the main processing steps of amethod 100 of detecting and tracking human faces across a sequence ofvideo frames. The steps of method 100 are effected by instructions inthe application program that are executed by the processor 705 of theprogrammable device 700 (FIG. 1). The method 100 receives in step 105 ateach frame interval n, a two dimensional array of pixel data. The pixeldata for each pixel includes colour values φ′(x,y,n) typically from animage sensor, such as that in camera 750 (FIG. 1). The colour valuesφ′(x,y,n) are typically in some colour space, such as RGB or LUV, andmay be received directly from camera 750, or may be from a recordedvideo sequence stored on the storage device 709 (FIG. 1).

The RGB colour space is not particularly suited for segmentation andhuman skin detection, and the colour values φ′(x,y,n) are converted intoa predetermined colour space in step 110 to form colour values φ(x,y,n).The predetermined colour space may be CIE Luv, or a combination ofcolour spaces, which is more suited to segmentation and skin detection.

Step 115 follows wherein the processor 705 performs spatiotemporal (3D)segmentation of the colour values φ(x,y,n), with time being the thirddimension. The segmentation is based on colour, forming contiguous 3Dsegments S_(i) of pixels having similar colour.

Any 3D-segmentation algorithm may be used in step 115. In the preferredimplementation, the Mumford-Shah 3D-segmentation algorithm is used, thedetail of which is described in a later section.

The preferred 3D segmentation uses the colour values φ(x,y,n) of the L+1most recently received frames, with L being a fixed, positive non-zerolatency of the 3D-segmentation, to produce as output a set ofthree-dimensional segments {S_(i)} having homogeneous colour. FIG. 6illustrates a sequence of video frames, with those frames illustrated inphantom being future frames not yet received by the programmable device700. A window 600 includes the L+1 most recently received frames,forming a “block” of pixel data. The block of pixel data also includes aview plane 610, which is coplanar with the oldest frame in the currentwindow 600 and having been received at frame interval n-L. As new colourvalues φ(x,y,n+1) are received in step 105, the new frame is added tothe window 600, while the oldest frame in the window 600 is removed fromthe window 600, thereby maintaining L+1 frames in the window 600.

Referring also to FIG. 2, with the set of 3D segments {S_(i)} foamed, instep 120 the processor 705 “slices through” the 3D segments S_(i) atframe interval t<n, which typically is the frame interval of the viewplane 610 (FIG. 6), to produce 2D segments s_(t) ^(i). Each 2D segments_(t) ^(i) incorporates the pixels at frame interval t that are includedin the corresponding 3D segment S_(i). Hence the 2D segments s_(t) ^(i)include pixel locations (x,y) satisfying:s _(t) ^(i)={(x,y):(x,y,t)∈S _(i)}  (1)

The latency L causes a “delay” between the frame index n of the receivedcolour values φ′(x,y,n) and the frame index t of the 2D segments s_(t)^(i) returned by step 120 such that:t=n−L  (2)

All subsequent processing is carried out on the 2D segments s_(t) ^(i)at frame interval t.

The method 100 continues to step 125 where the processor 705 selects thenext unlabeled segment s_(t) ^(i) for evaluation. As the segment s_(t)^(i) is merely the intersection of 3D segment S_(i), once a segments_(t) ^(i) is labelled as being a human face or not, that label isextended to the corresponding 3D segment S_(i). Similarly, once the 3Dsegment S_(t) has been labeled, all subsequent 2D segments s_(t) ^(i)associated with that 3D segment S_(i) receive the same label, and nofurther evaluation of such segments s_(t) ^(i) is required.

Step 127 determines whether the area of segment s_(t) ^(i) is equal toor above a minimum area m_(A). In the preferred implementation theminimum area m_(A) is 1500 pixels. If the area of segment s_(t) ^(i) issmaller than the minimum area m_(A), then the current segment s_(t) ^(i)is not evaluated any further and method 100 proceeds to step 160 whereit is determined whether there are any more unlabeled segments s_(t)^(i) that has not been evaluated. If there are unlabeled segments s_(t)^(i) yet to be evaluated, then method 100 returns to step 125 where anext unlabeled segment s_(t) ^(i) is selected for evaluation.

If the area of segment s_(t) ^(i) is equal to or above the minimum aream_(A), then step 130 determines whether the segment s_(t) ^(i) satisfiesa number of pre-filtering criteria. Segments s_(t) ^(i) not satisfyingeach of the pre-filtering criteria are likely not to correspond to ahuman face and may therefore be omitted from further processing. Suchpre-filtering is optional and may include criteria such as whether thesegment s_(t) ^(i) selected in step 125 has an elliptical shape, whetherthe segment s_(t) ^(i) has the colour of skin, and whether or not thesegment s_(t) ^(i) moves. Step 130 is described in detail in a latersection.

If any one of the pre-filtering criteria was not met, then the currentsegment s_(t) ^(i) is not evaluated any further and method 100 proceedsto step 160 where it is determined whether there are any more unlabeledsegments s_(t) ^(i) that has not been evaluated.

If the current segment s_(t) ^(i) met all the pre-filter criteria, thenmethod 100 continues to step 140 where the processor 705 extracts facialfeatures, such as the eyes and the mouth, from the segment s_(t) ^(i)under consideration. FIG. 3 shows a flow diagram of the sub-steps of thefacial feature extraction step 140. Step 140 starts in sub-step 305 bycreating a boundary box sub-image b(x,y) of the frame data at frameinterval t, with the boundary box sub-image b(x,y) being a rectangularshaped image including the colour values φ(x,y,t) within a bounding boxformed around the segment s_(t) ^(i) being evaluated. Colour is nolonger needed and sub-step 310 converts the boundary box sub-imageb(x,y) into greyscale image b′(x,y). In order to reduce thecomputational effort on the processor 705 in the substeps of the facialfeature extraction step 140 that follows, the greyscale image b′(x,y) isre-scaled in sub-step 312 such that the area of the segment in there-scaled greyscale image b″(x,y) (hereinafter simply greyscale imageb″(x,y)) is equal to the minimum area m_(A). The greyscale image b″(x,y)is also stored in memory 706.

The detection of the facial features (the eyes and the mouth) is basedon two characteristics of such features. A first is that these featuresare darker than the rest of the face, and a second is that they givesmall edges after edge detection. Using these characteristics, twofacial feature maps f₁(x,y) and f₂(x,y) are formed from the greyscaleimage b″(x,y).

The first face feature map f₁(x,y) is formed by applying a threshold insub-step 315 to the greyscale image b″(x,y), giving a value of “1” topixels with intensity values lower than a predetermined value and avalue of “0” to pixels with intensity values above the predeterminedvalue. Blobs, being defined as dark regions in the first face featuremap f₁(x,y), appear where the pixels of the image have an intensityvalue lower than the threshold.

The second facial feature map f₂(x,y) is formed by first applying insub-step 320 an edge detection algorithm to the greyscale image b″(x,y)(formed in sub-step 312) to create edges. Any edge detection techniquemay be used, but in the preferred implementation the Prewitt edgedetection technique is used, which may be described as a differentialgradient approach. Each of two masks M_(x) and M_(y) are convolved withthe greyscale image b″(x,y), resulting in a local horizontal and a localvertical gradient magnitude, g_(x) and g_(y). The two masks M_(x) andM_(y) are given by:

$\begin{matrix}{{M_{x} = \begin{bmatrix}{- 1} & 0 & 1 \\{- 1} & 0 & 1 \\{- 1} & 0 & 1\end{bmatrix}};{and}} & (3) \\{M_{y} = \begin{bmatrix}{- 1} & {- 1} & {- 1} \\0 & 0 & 0 \\1 & 1 & 1\end{bmatrix}} & (4)\end{matrix}$

A local edge magnitude is then given by:g=max(|g _(x) |,|g _(y)|)  (5)

to which a threshold is applied in order to get a binary edge map 321.In sub-step 325 a dilatation (mathematical morphology) is applied on thebinary edge map 321, thereby enlarging the edges, to form facial featuremap f₂(x,y).

Finally, a combined facial feature map F(x,y) is formed in sub-step 330by calculating an average of the two facial feature maps f₁(x,y) andf₂(x,y). By averaging the two facial feature maps f₁(x,y) and f₂(x,y), amore robust feature extraction is achieved. The combined facial featuremap F(x,y) is also the output of step 140.

The combined facial feature map F(x,y) includes blobs at positions ofpossible facial features, with those possible facial features ideallyrepresenting the eyes and the mouth. However, the possible facialfeatures may include facial features that are not the eyes and themouth, with such possible facial features adding some errors.Accordingly, a selection is required to determine which of the possiblefacial features inside the combined facial feature map F(x,y) have ahigh probability of corresponding to the eyes and the mouth.

Referring again to FIG. 2, step 140 is followed by step 145 wheretriangles arm formed from the possible facial features in the combinedfacial feature map F(x,y), and a probability {overscore (p)}_(k) foreach triangle is calculated based on the similarity of the geometry ofthe triangle with the geometry of a human face model, and in particularthe geometry of the eyes and the mouth.

FIG. 4 shows a flow diagram of the sub-steps of step 145. Step 145starts in sub-step 403 where a threshold is applied to the combinedfacial feature map F(x,y) to produce a final facial feature map F(x,y).The values of the combined facial feature map F(x,y) will be either 0,0.5 or 1. Therefore, dependent on the threshold, those possible facialfeatures appearing only in one of the two facial feature maps f₁(x,y)and f₂(x,y) will either be included or excluded in the final facialfeature map F′(x,y).

In sub-step 405 a triangle is formed from the centroids of three of thepossible facial features in the final facial feature map 336. For eachsuch triangle t_(k), a probability {overscore (p)}_(k) is determinedthat the triangle under consideration includes the true facial featuresof two eyes and a mouth by evaluating a number of the characteristics ofthe triangle t_(k).

The angle α that the uppermost line of the triangle t_(k) makes with thehorizontal is determined in sub-step 410. FIG. 8 shows an example of thetriangle t_(k) formed between the centroids of three of the possiblefacial features and the angle α that the uppermost line of the trianglemakes with the horizontal. This line is assumed to be the eyeline of thepossible facial features represented by the triangle t_(k). A horizontalprobability p₁ is determined in sub-step 420 from this angle α for thetriangle t_(k) as follows:p ₁ =e ^(−2α) ¹   (6)

Using the angle α, the positions of all the centroids are recalculatedin sub-step 415 in order to have the uppermost line of the trianglet_(k) under consideration horizontal. In sub-step 425 an eyebrowprobability p₂ is then calculated for the two possible facial featuresassumed to be the eyes, by determining whether there are other possiblefacial features situated within a window above those possible eye facialfeatures. As eyebrows have similar properties to that of eyes, theeyebrow probability p₂ of a triangle t_(k) which includes eyes, would behigher than that including the eyebrows. Let d be the distance betweenthe two possible eyes (or length of cycline), and ξ_(i) the vectorjoining the possible eye facial feature j and a possible facial featurei above it. The eyebrow probability p₂ may then be calculated for eachpossible eye as follows:

$\begin{matrix}{p_{2j} = \left\{ \begin{matrix}1 & {{{{{{if}\mspace{14mu}\xi_{ix}} < \frac{d}{5}}\mspace{14mu}\&}\mspace{14mu}{\xi_{iy}}} < \frac{2*d}{5}} \\0 & {else}\end{matrix} \right.} & (7)\end{matrix}$with ξ_(ix) being the vertical component of the vector ξ_(i) and ξ_(iy)being the horizontal component of the vector ξ_(i). This test isrepeated for each facial feature i above a possible eye facial featurej, and a probability of one is given to the eyebrow probability p_(2j)of possible facial feature j if at least one facial feature i gives aprobability of one.

The eyebrow probability for the triangle t_(k) is given by:

$\begin{matrix}{p_{2} = {\frac{1}{2}{\sum\limits_{j = 1}^{2}p_{2j}}}} & (8)\end{matrix}$

Considering the angles α₁ and α₂ formed at the possible eye facialfeatures of the triangle t_(k), experimental results have shown thatthese angles α₁ and α₂ range between 0.7 and 1.4 radians for trianglesof real faces. These angles α₁ and α₂ are also illustrated in FIG. 8. Anangle probability p₃ is determined in sub-step 430 from the angles α₁and α₂ of the triangle t_(k) as follows:

$\begin{matrix}{{p_{\alpha_{j}}\left( \alpha_{j} \right)} = \left\{ \begin{matrix}0 & {{{{{{if}\mspace{14mu}\alpha_{j}} \leq 0.5}\mspace{14mu}\&}\mspace{14mu} 1.7} \leq \alpha_{j}} \\{{2.5*\alpha_{j}} - 1.25} & {{{if}\mspace{14mu} 0.5} < \alpha_{j} < 0.9} \\{{{- 2.5}*\alpha_{j}} + 4.25} & {{{if}\mspace{14mu} 1.3} < \alpha_{j} < 1.7} \\1 & {{{if}\mspace{14mu} 0.9} < \alpha_{j} < 1.3}\end{matrix} \right.} & (9) \\{{{with}\mspace{20mu} p_{3}} = \frac{p_{\alpha\; 1} + p_{\alpha\; 2}}{2}} & (10)\end{matrix}$

Ideally, a facial feature should be detected in each of the facialfeature maps f₁(x,y) and f₂(x,y) (FIG. 3). Therefore, the mean of thefacial feature maps f₁(x,y) and f₂(x,y), which was determined insub-step 330 (FIG. 3) as the combined facial feature map F(x,y),provides another measure of whether the facial features of the trianglet_(k) are true facial features. Therefore a segment probability p₄calculated in sub-step 435 is given by:

$\begin{matrix}{p_{4j} = {\frac{1}{\eta_{j}}{\sum\limits_{l = 1}^{\eta_{j}}\rho_{l}}}} & (11) \\{{{and}\mspace{14mu} p_{4}} = {\frac{1}{3}{\sum\limits_{j = 1}^{3}p_{4j}}}} & (12)\end{matrix}$

η_(j): number of pixels in the facial feature segment j.

ρ_(t): grey value of pixel l of possible facial feature j from thecombined facial feature map F(x,y).

The relative position of the triangle t_(k) inside the bounding boxsub-image b(x,y) gives position probability p₅, which is calculated insub-step 440. Experimental results have shown that the eyes are mostoften situated at the top of such a boundary box sub-image b(x,y).Position probability p₅ takes into consideration the position of thehighest of the possible eyes in probability p₅₁ and the distance betweenthe possible eyes in relation to the width of the boundary box sub-imageb(x,y) in probability p₅₂ as follows:

$\begin{matrix}{p_{51} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} X_{c}} < \frac{2*Y}{3}} \\{{- \frac{3*Y_{c}}{Y}} + 3} & {elsewhere}\end{matrix} \right.} & (13) \\{p_{52} = \left\{ \begin{matrix}1 & {{{{{{if}\mspace{14mu} e} < \frac{Y}{2}}\mspace{14mu}\&}\mspace{14mu} e} > \frac{Y}{4}} \\{\frac{{- 4}*e}{Y} + 3} & {{{{{{if}\mspace{14mu} e} > \frac{Y}{2}}\mspace{14mu}\&}\mspace{14mu} e} < \frac{3*Y}{4}} \\\frac{4*e}{Y} & {{{if}\mspace{14mu} e} < \frac{Y}{4}} \\0 & {else}\end{matrix} \right.} & (14) \\{p_{5} = \frac{p_{51} + p_{52}}{2}} & (15)\end{matrix}$

with

X_(c) is the x-axis coordinate of the highest possible eye facialfeature of the triangle t_(k);

Y is the width of the boundary box sub-image b(x,y); and

e is the distance between the two possible eye facial features.

The probability {overscore (p)}_(k) of a triangle t_(k) to be a truefacial feature triangle is calculated and stored in sub-step 450, and isgiven by:

$\begin{matrix}{{\overset{\_}{p}}_{k} = {\sum\limits_{l = 1}^{5}{\pi_{l}*p_{l}}}} & (16)\end{matrix}$

with

-   -   π_(i) being predetermined probability weight factors

$\left( {{\sum\limits_{l}\pi_{l}} = 1} \right)$

In the preferred implementation, the predetermined probability weightfactors are π₁=0.2, π₂=0.1, π₃=0.1, π₄=0.5, and λ₅=0.1.

Sub-step 460 determines whether there are more triangles t_(k) to beconsidered. If there are more triangles t_(k), substeps 405 to 455 arerepeated for a next triangle t_(k). Alternatively, with all thetriangles t_(k) in the 2D segment s_(n) ^(i) having a probability{overscore (p)}_(k) of being a facial feature triangle assigned, apredetermined number of those triangles t_(k) that have the highestprobabilities {overscore (p)}_(k) are stored in step 470, after whichstep 145 ends. In particular, the angles α_(1k) and α_(2k) formedbetween the “eyeline” and the other two sides (from left to right) andthe positions (x_(1k),y_(1k)), (x_(2k),y_(2k)) and (x_(3k),y_(3k)) ofthe three centroids forming the corners of the triangle t_(k) (startingfrom the upper left corner and proceeding clockwise) are stored.

Referring again to FIG. 2, step 145 is followed by step 150 where theprocessor 705 uses the stored properties of the triangles t_(k) to matchthe triangles t_(k) with a feature triangle T_(j) of the 3D segmentS_(t), and to accumulate the probabilities {overscore (p)}_(k) insubsequent frames into an accumulated probability P_(j) for each suchfeature triangle T_(j), until a sufficiently strong accumulatedprobability P_(j) is available on which to make robust classificationdecisions. In the preferred implementation, a maximum number N_(max) offeature triangles to be stored is set to 10. The 3D segmentation (step115) allows this accumulation of the probabilities {overscore (p)}_(k)to occur over the temporal history of each segment S_(i).

FIG. 5 shows a flow diagram of the sub-steps of step 150. Step 150starts in sub-step 505 where a next stored triangle t_(k) is retrievedfrom the storage device 709. Sub-step 510 determines whether it is thefirst time that 3D segment S_(i), of which segment s_(n) ^(i) is anintersection, is being analysed. If sub-step 510 determines that it isthe first time that 3D segment S_(i) is being analysed, then theproperties of that triangle t_(k) are retained in sub-step 515 asfeature triangle T_(k) and the accumulated probability P_(k) of featuretriangle T_(k) is set to that of triangle t_(k), ie. {overscore(p)}_(k). Step 150 then proceeds to sub-step 550 where it is determinedwhether there are any more stored triangles t_(k) that are to beevaluated by step 150.

If sub-step 510 determines that it is not the fit time that 3D segmentS_(i) is being analysed, then sub-step 520 evaluates a distance D(T_(l),t_(k)) between the triangle t_(k) under consideration and each of thefeature triangles T_(l), The distance D(T_(l), t_(k)) is given by:

$\begin{matrix}{{D\left( {T_{l},t_{k}} \right)} = {{\sum\limits_{c = 1}^{3}\left( {x_{cl} - x_{ck}} \right)^{2}} + {\sum\limits_{c = 1}^{3}\left( {y_{cl} - y_{ck}} \right)^{2}} + {\sum\limits_{c = 1}^{2}\left( {\alpha_{cl} - \alpha_{ck}} \right)^{2}}}} & (17)\end{matrix}$

Sub-step 525 then determines whether the lowest of the distancesD(T_(l),t_(k)) is lower than a predetermined threshold. If the lowest ofthe distances D(T_(l),t_(k)) is lower than the predetermined threshold,then the triangle t_(k) under consideration is sufficiently similar tofeature triangle T_(l) to be that feature triangle in the current frame.Accordingly, in sub-step 530 that feature triangle T_(l) is set to betriangle t_(k) and its accumulated probability P_(l) is set to:

$\begin{matrix}{P_{l} = \frac{\frac{P_{l} + {\overset{\_}{p}}_{k}}{2} + \frac{{Nd}_{l}}{T\_ life} + \frac{{Nd}_{l}}{S\_ life}}{3}} & (18)\end{matrix}$wherein: Nd_(t) is the number of times the feature triangle T_(t) hasbeen detected through time,

-   -   T_life is the lifetime of the feature triangle T_(l), which is        the difference between the current frame index t and the frame        index of the frame where feature triangle T_(l) first appeared;    -   S_life is the life time of the segment S_(i) under        consideration, which is the difference between the frame index t        and the frame index of the frame where 3D segment S_(i) first        had at least one feature triangle T_(k) detected.

If sub-step 525 determines that the lowest of the distancesD(T_(l),t_(k)) is not lower than the predetermined threshold, then thetriangle t_(k) under consideration is not sufficiently similar to anyone of the feature triangles T_(l). If sub-step 534 determines that thenumber of stored feature triangles is lower than the number N_(max), theproperties of that triangle t_(k) are then retained in sub-step 535 asfeature triangle T_(N+1), and the accumulated probability P_(N+1) offeature triangle T_(N+1) is set to that of triangle t_(k), ie.{overscore (p)}_(k). From sub-step 534 or sub-step 535, step 150proceeds to sub-step 550 where it is determined whether there are anymore stored triangles t_(k) that has to be evaluated by step 150. Ifmore triangle t_(k) remain to be evaluated, then step 150 returns tosub-step 505.

If all triangles t_(k) have been processed, then sub-step 552recalculates the accumulated probabilities P_(l) of the featuretriangles T_(l) that has not been found similar to any of the trianglest_(k) as:

$\begin{matrix}{P_{l} = \frac{\frac{P_{l}}{2} + \frac{Nd}{T\_ life} + \frac{{Nd}_{l}}{S\_ life}}{2}} & (19)\end{matrix}$

Step 150 then proceeds to sub-step 555 where a number of the featuretriangles T_(l) are discarded. In particular, if the accumulativeprobability P_(l) of triangle T_(l) becomes lower than a predeterminedthreshold, such as 0.5, and the lifetime T_life of the feature triangleT_(l) is higher than another predetermined threshold, such as 6, thensuch a feature triangle T_(l) is discarded.

Referring again to FIG. 2, method 100 uses the accumulated probabilitiesP_(l) of the feature triangles T_(l) in step 155 for classifying the 2Dsegment s_(t) ^(i), and also its associated 3D segment S_(i), as a humanface if at least one of the accumulated probabilities P_(l) is higherthan a predetermined threshold, such as 0:75, and the life time T_lifeof that feature triangle T_(l) is higher than another predeterminedthreshold, such as 3. 3D segment S_(i) is assigned a face label and willretain that label for the duration of that segment S_(i).

Next step 160 determines whether there are any more unlabeled segmentss_(t) ^(i) that have not been evaluated If there are unlabeled segmentss_(t) ^(i) yet to be evaluated, then method 100 returns to step 125where a next unlabeled segment s_(t) ^(i) is selected for evaluation. Ifall unlabeled segments s_(t) ^(i) have been evaluated, then step 165determines whether there are more frames in the video sequence. If moreframes exist, then method 100 proceeds to step 170 where the frame indexn is incremented before method 100 proceeds to step 105 where the twodimensional array of pixel data of the next frame is received.Alternatively method 100 ends in step 175.

3D Segmentation

The spatiotemporal (3D) segmentation of the video data based on colourperformed by the processor 705 in step 115 (FIG. 2) will now bedescribed in more detail. The segmentation step 115 segments thethree-dimensional block of pixel data of the L+1 most recently receivedframes into a set of three-dimensional segments {S_(i)}, so that everypixel in the block is related to one segment S_(i) in which all pixelsbelonging to the same segment S_(i) have homogeneous colour valuesφ(x,y,n).

An assumption underlying the segmentation problem is that each colourvalue φ(x,y,n) is associated with a particular state. The model used todefine the states is decided upon in advance. Each state is defined byan unknown segment model parameter vector {overscore (α)}_(i) of lengthc, with each state being assumed to be valid over the contiguous 3Dsegment S_(i). The aim of segmentation is to identify these 3D segmentsS_(i) and the model parameters {overscore (α)}_(i) for each segmentS_(i).

A model vector of measurements γ(x,y,n) over each segment S_(i) isassumed to be a linear projection of the c-vector model parameter{overscore (α)}_(i) for that segment S_(i):γ(x,y,n)=A(x,y,n){overscore (α)}_(i), (x,y,n)∈S _(i)  (20)where A(x,y,n) is an m by c matrix, which relates the state of segmentS_(i) to the model measurements γ(x,y,n), thereby encapsulating thenature of the predefined model. In the colour video segmentation case,c=m and matrix A(x,y,n) is the c by c identity matrix for all (x,y,n).

Each vector of actual colour values φ(x,y,n) is subject to a randomerror e(x,y,n) such thatφ(x,y,n)=γ(x,y,n)+e(x,y,n)  (21)

Further, the error e(x,y,n) may be assumed to be drawn from a zero-meannormal (Gaussian) distribution with covariance Λ(x,y,n):e(x,y,n)˜N(0, Λ(x,y,n))  (22)

wherein Λ(x,y,n) is a c by c covariance matrix. Each component of theerror e(x,y,n) is assumed to be independently and identicallydistributed, i.e.:Λ(x,y,n)=σ²(x,y,n)I _(c)  (23)

where I_(c) is the c by c identity matrix.

Variational segmentation requires that a cost function E be assigned toeach possible segmentation. The cost function E used in the preferredimplementation is one in which a model fitting error is balanced with anoverall complexity of the model. The sum of the statistical residuals ofeach segment S_(i) is used as the model fitting error. CombiningEquations (20), (21), (22) and (23), the residual over segment S_(i) asa function of the model parameters α_(i) is given by

$\begin{matrix}{{E_{i}\left( a_{i} \right)} = {\sum\limits_{{({x,y,n})}{mS}_{i}}\;{\left\lbrack {{\phi\left( {x,y,n} \right)} - a_{i}} \right\rbrack^{T}\left\lbrack {{\phi\left( {x,y,n} \right)} - a_{i}} \right\rbrack}}} & (24)\end{matrix}$

A partition into segments S_(i) may be compactly described by a binaryfunction J(d), in which the value one (1) is assigned to each boundarypixel bordering a segment S_(i). This function J(d) is referred to as aboundary map. The model complexity is simply the number ofsegment-bounding elements d. Hence the overall cost frictional E may bedefined as

$\begin{matrix}{{{E\left( {\gamma,J,\lambda} \right)} = {{\sum\limits_{l}\;{E_{i}\left( a_{i} \right)}} + {\lambda{\sum\limits_{d}{J(d)}}}}},} & (25)\end{matrix}$

where the (non-negative) parameter λ controls the relative importance ofmodel fitting error and model complexity The contribution of the modelfitting error to the cost functional E encourages a proliferation ofsegments, while the model complexity encourages few segments. Thefunctional E must therefore balance the two components to achieve areasonable result. The aim of variational segmentation is to find aminimising model vector {overscore (γ)} and a minimising boundary map{overscore (J)}(d) of the overall cost functional E, for a givenparameter λ value.

Note that if the segment boundaries d are given as a valid boundary mapJ(d), the minimising model parameters {overscore (α)}_(i) over eachsegment S_(i) may be found by minimising the segment residuals E_(i).This may be evaluated using a simple weighted linear least squarescalculation. Given this fact, any valid boundary map J(d) will fully anduniquely describe a segmentation. Therefore, the cost function E may beregarded as a function over the space of valid edge maps (J-space),whose minimisation yields an optimal segment partition {overscore(J)}_(λ) for a given parameter λ. The corresponding minimising modelparameters {overscore (α)}_(i) may then be assumed to be those whichminimise the residuals E_(i) over each segment S_(i). The correspondingminimum residuals for segment S_(i) will hereafter be written as Ē_(i).

If parameter λ is low, many boundaries are allowed, giving “fine”segmentation. As parameter λ increases, the segmentation gets coarser.At one extreme, the optimal segment partition {overscore (J)}_(o), wherethe model complexity is completely discounted, is the trivialsegmentation, in which every pixel constitutes its own segment S_(i),and which gives zero model fitting error e. On the other hand, theoptimal segment partition {overscore (J)}_(o), where the model fittingerror e is completely discounted, is the null or empty segmentation inwhich the entire block is represented by a single segment S_(i).Somewhere between these two extremes lies the segmentation {overscore(J)}_(λ), which will appear ideal in that the segments S_(i) correspondto a semantically meaningful partition.

To find an approximate solution to the variational segmentation problem,a segment merging strategy has been employed, wherein properties ofneighbouring segments S_(i) and S_(j) are used to determine whetherthose segments come from the same model state, thus allowing thesegments S_(i) and S_(j) to be merged as a single segment S_(ij). Thesegment residual E_(ij) also increases after any 2 neighbouring segmentsS_(i) and S_(j) are merged.

Knowing that the trivial segmentation is the optimal segment partitionJ_(λ) for the smallest possible parameter λ value of 0, in segmentmerging, each voxel in the block is initially labelled as its own uniquesegment S_(i), with model parameters are set to the colour valuesφ(x,y,n). Adjacent segments S_(i) and S_(j) are then compared using somesimilarity criterion and merged if they are sufficiently similar. Inthis way, small segments take shape, and are gradually built into largerones.

The segmentations {overscore (J)}_(λ) before and after the merger differonly in the two segments S_(i) and S_(j). Accordingly, in determiningthe effect on the total cost functional E after such a merger, acomputation may be confined to those segments S_(i) and S_(j). Byexamining Equations (24) and (25), a merging cost for the adjacentsegment pair {S_(i)S_(j)} may be written as,

$\begin{matrix}{\tau_{ij} = \frac{{\overset{\_}{E}}_{ij} - \left( {{\overset{\_}{E}}_{i} + {\overset{\_}{E}}_{j}} \right)}{l\left( \delta_{ij} \right)}} & (26)\end{matrix}$

where l(δ_(ij)) is the area of the common boundary betweenthree-dimensional segments S_(i) and S_(j). If the merging cost τ_(ij)is less than parameter λ, the merge is allowed.

The key to efficient segment growing is to compute the numerator of themerging cost τ_(ij) as fast as possible. Firstly, Equation (24) isrewritten as:E _(j)(α_(j))=(F _(j) −H _(j)α_(j))^(T)(F _(j) −H _(j)α_(j))  (27)where:

H_(j) is an (v_(j)c) by c matrix composed of the c by c identitymatrices stacked on top of one another as (x,y,n) varies over segmentS_(j), with v_(j) the number of voxels in segment S_(j); and

F_(j) is a column vector of length (v_(j)c) composed of the individualcolour value φ(x,y,n) vectors stacked on top of one another.

By weighted least square theory, the minimising model parameter vector{overscore (α)}_(j) for the segment S_(j) is given by the mean of thecolour value φ(x,y,n) over segment S_(j).

Let κ_(j) be the confidence in the model parameter estimate {overscore(α)}_(j), defined as the inverse of its covariance:κ_(j)=Λ_(j) ⁻¹ =H _(j) ^(T) H _(j)  (28)

which simply evaluates to v_(j)I_(c). The-corresponding residual isgiven byĒ _(j)=(F _(j) −H _(j){overscore (α)}_(j))^(T)(F _(j) −H _(j){overscore(α)}_(j))  (29)

When merging two segments S_(i) and S_(j), the “merged” matrix H_(ij) isobtained by concatenating matrix H_(i) with matrix H_(j); likewise formatrix F_(ij). These facts may be used to show that the best fittingmodel parameter vector {overscore (α)}_(ij) for the merged segmentS_(ij) is given by:

$\begin{matrix}{{\overset{\_}{a}}_{ij} = \frac{\left( {{v_{i}{\overset{\_}{a}}_{i}} - {v_{j}{\overset{\_}{a}}_{j}}} \right)}{v_{i} + v_{j}}} & (30)\end{matrix}$and the merged confidence is;κ_(ij)=κ_(i)+κ_(j)=ν_(ij) I _(c)  (31)

The merged residual is given by:

$\begin{matrix}{{\overset{\_}{E}}_{ij} = {{\overset{\_}{E}}_{i} + {\overset{\_}{E}}_{j} + {\left( {{\overset{\_}{a}}_{i} - {\overset{\_}{a}}_{j}} \right)^{T}\left( {{\overset{\_}{a}}_{i} - {\overset{\_}{a}}_{j}} \right){\frac{v_{i}v_{j}}{v_{i} + v_{j}}.}}}} & (32)\end{matrix}$

The merging cost τ_(ij) in Equation (26) may be computed as:

$\begin{matrix}{\tau_{ij} = \frac{{{{\overset{\_}{a}}_{i} - {\overset{\_}{a}}_{j}}}^{2}\frac{v_{i}v_{j}}{v_{i} + v_{j}}}{l\left( \delta_{ij} \right)}} & (33)\end{matrix}$

from the model parameters and confidences of the segments S_(i) andS_(j) to be merged. If the merge is allowed, Equations (30) and (31)give the model parameter {overscore (α)}_(ij) and confidence κ_(ij) ofthe merged segment S_(ij).

During segment-merging segmentation, the merging of segments must stoponce the merging cost τ_(y) exceeds a predetermined threshold λ_(stop).Note that under this strategy, only Equations (30), (31), and (33) needto be applied throughout the merging process. Only the model parameters{overscore (α)}_(i) and their confidences κ_(i) for each segment S_(i)are therefore required as segmentation proceeds. Further, neither theoriginal colour values φ(x,y,n) nor the model structure itself (i.e. thematrices A(x,y,n)) are required.

FIG. 7 shows the 3D segmentation step 115 (FIG. 1) in more detail. The3D segmentation step 115 starts in sub-step 804 which sets the modelparameters {overscore (α)}(x,y,n) to the colour values φ(x,y,n), and themodel confidences κ(x,y,n) to the identity matrix I_(c) for each voxelin the block of L+1 frames. The 3D segmentation starts with the trivialsegmentation where each voxel forms its own segment S_(i). Sub-step 806then determines all adjacent segment pairs S_(i) and S_(j), and computesthe merging cost τ_(ij) according to Equation (33) for each of theboundaries between adjacent segment pairs S_(i) and S_(j). Sub-step 808inserts the boundaries with merging cost τ_(ij) into a priority queue Qin priority order.

Sub-step 810 takes the first entry from the priority queue Q(1) andmerges the corresponding segment pair S_(i) and S_(j) (i.e. the segmentpair S_(i) and S_(j) with the lowest merging cost τ_(ij)) to form a newsegment S_(ij).

Sub-step 814 identifies all boundaries between segments S_(i) adjoiningeither of the merged segments S_(i) and S_(j), and merges any duplicateboundaries, adding their areas. Sub-step 818 follows where the processor705 calculates a new merging cost τ_(ij,1) for each boundary betweenadjacent segments S_(ij) and S_(t). The new merging costs τ_(ij,1)effectively reorder the priority queue Q into the final sorted queue insub-step 818.

Sub-step 818 passes control to sub-step 822 where the processor 705determines whether the merging cost τ_(ij) corresponding to the segmentsS_(i) and S_(j) at the top of the priority queue Q (entry Q(1)) has avalue greater than a predetermined threshold λ_(stop), which signifiesthe stopping point of the merging. If the merging has reached thestopping point, then the 3D segmentation step 115 ends. Alternatively,control is returned to sub-step 810 from where sub-steps 810 to 822 arerepeated, merging the two segments with the lowest merging cost τ_(ij)every cycle, until the stopping point is reached.

Referring again to FIG. 6, as noted previously, when frame data of a newframe is received in step 115 (FIG. 1), the new frame is added to thewindow 600, while the oldest frame in the window 600 is removed from theblock of pixel data. The 3D segmentation step 115 is performed as eachnew frame is received in step 105. However, after the 3D segmentationstep 115 described with reference to FIG. 7 has been performed a firsttime, in subsequent execution of the 3D segmentation step 115, thesegments S_(i) formed in a previous segmentation are maintained insub-step 804, with only the model parameters {overscore (α)}(x,y,n) andmodel confidences κ(x,y,n) of the new fame being set to the colourvalues φ(x,y,n) and the identity matrix I_(c) respectively. The effectof the 3D segmentation step 115 is thus to merge the unsegmented pixelsof the new frame into the existing segments S_(i) from a previoussegmentation. Those existing segments S_(i) from a previous segmentationmay adjust due to the information contained in the new frame.

Segment Pre-Filtering

Step 130 (FIG. 2) which determines whether the segment s_(t) _(i)satisfies a number of pre-filtering criteria, so that segments s_(t)^(i) that are likely not to correspond to a human face may be omittedfrom further processing, will now be described in more detail. FIG. 9shows a flow diagram of the sub-steps of step 130.

In the preferred implementation the optional pre-filtering criteriainclude whether the segment s_(t) ^(i) selected in step 125 has anelliptical shape, whether the segment s_(t) ^(i) has the colour of skin,and whether or not the segment s_(t) ^(i) moves. Any number of thepre-filtering criteria may be pre-selected by the user of the method100.

Typically, the head of a person can be modelled as an ellipse, with aratio of 1.2 to 1.4 between the two principal axes of such an ellipse.Step 130 starts by determining in sub-step 905 whether an ellipticalpre-filter has been pre-selected. If the elliptical pre-filter has beenpre-selected, processor 705 determines whether the segment s_(t) ^(i)selected in step 125 has an elliptical shape. In particular, in sub-step910 the processor 705 calculates estimates of the compactness and theeccentricity of the 2D segment s_(t) ^(i), with the compactness beingthe ratio of the perimeter of segment s_(t) ^(i) against the area ofsegment s_(t) ^(i), and the eccentricity being the ratio of the width ofsegment s_(t) ^(i) against the height of segment s_(t) ^(i). Theprocessor 705 then determines in sub-step 915 whether the compactnessand the eccentricity of the segment s_(t) ^(i) fall within predefinedranges. If either of the compactness or the eccentricity of the segments_(t) ^(i) does not fall within the predefined ranges, then the segments_(t) ^(i) is not elliptical and is therefore not considered anyfurther. Step 130 ends and method 100 (FIG. 2) continues to step 160(FIG. 2).

Another property of a human face is that the colour of human skin isdistinctive from the colour of many other natural objects. By analysingskin colour statistics, one observes that human skin colour isdistributed over a small area in the chrominance plane. Furthermore,colour is orientation invariant under certain lighting conditions,robust under partial occlusion, rotation, scale changes and resolutionchanges. Accordingly, if the segment s_(t) ^(i) is determined to beelliptical, or from sub-step 905 if the elliptical pre-filter has notbeen pre-selected, it is then determined in sub-step 918 whether a skincolour pre-filter has been preselected. If the skin colour pre-filterhas been preselected, then it is determined whether the segment s_(t)^(i) has the colour of skin.

Sub-step 920 calculates the Mahalanobis distance between the averagecolour value of the segment s_(t) ^(i) in the predefined colour space(after step 110) and a predetermined skin colour model. Thepredetermined skin colour model is created by extracting colour valuesfrom skin pixels from several images that contain faces. A mean μ andcovariance matrix Σ of the colour values are calculated, therebyobtaining statistical measures representing those colour values. It isnoted that all or a sub-group of the components of the colour space maybe used in sub-step 920. For example, when using the CIE Luv colourspace, all three Luv components may be used, or alternatively, theluminance L component may be ignored.

With z_(i) being the average colour value of segment s_(t) ^(i), theMahalanobis distance D_(M)(z₁) for segment s_(t) ^(i) is defined as:D _(M)(z _(i))=(z _(i)−μ)Σ⁻¹(z _(i)−μ)  (34)

Values for the Mahalanobis distance D_(M)(z_(i)) vary between zero andinfinity. A membership function Mf is used to transfer the Mabalanobisdistance D_(M)(z_(i)) to a skin probability as follows:

$\begin{matrix}\left\{ \begin{matrix}{{{Mf}\left( {D_{M}\left( z_{i} \right)} \right)} = 1} & {if} & {{D_{M}\left( z_{i} \right)} \leq {{val}\; 1}} \\{{{Mf}\left( {D_{M}\left( z_{i} \right)} \right)} = 0} & {if} & {{D_{M}\left( z_{i} \right)} \geq {{val}\; 2}} \\{{{Mf}\left( {D_{M}\left( z_{i} \right)} \right)} = \frac{{D_{M}\left( z_{i} \right)} - {{val}\; 2}}{{{val}\; 1} - {{val}\; 2}}} & {if} & {{{val}\; 1} < {D_{M}\left( z_{i} \right)} < {{val}\; 2}}\end{matrix} \right. & (35)\end{matrix}$

with val1 and val2 being predetermined values. In the preferredimplementation the predetermined values are val1=2 and val2=2,5.

Sub-step 925 determines whether the skin probability is above apredetermined threshold. If the skin probability is below the threshold,then the segment s_(t) ^(i) is not skin coloured and is therefore notconsidered any further. Step 130 ends and method 100 (FIG. 2) continuesto step 160 (FIG. 2).

Yet another observation is that most non-moving segments s_(t) ^(i)belong to the background and therefore have a low probability ofcontaining a human face. Accordingly, if sub-step 925 determines thatthe segment s_(t) ^(i) is skin coloured, or if sub-step 918 determinesthat the skin-colour pre-filter has not been pre-selected, then theprocessor 705 determines in sub-step 928 whether or not a movementpre-filter has been pre-selected. If the movement pre-filter has beenpre-selected, then it is determined whether or not the segment s_(t)^(i) moves. Any technique may be used in order to decide whether or nota segment s_(t) ^(i) moves. In a specific implementation, a staticcamera 750 is assumed, and sub-step 930 determines whether the centroidof the segment s_(t) ^(i) moved more than a predetermined number ofpixels, such as 10. If sub-step 930 determines that the centroid of thesegment s_(t) ^(i) did not move more than the predetermined number ofpixels, then the segment s_(t) ^(i) is deemed to be background and istherefore not considered any further. Step 130 ends and method 100 (FIG.2) continues to step 160 (FIG. 2).

Alternatively, if sub-step 930 determines that the centroid of thesegment s_(t) ^(i) did move more than the predetermined number ofpixels, or sub-step 928 determined that the movement pre-filter has notbeen selected, then step 130 ends and method 100 (FIG. 2) continues tostep 140 (FIG. 4).

The foregoing describes only some embodiments of the present invention,and modifications and/or changes can be made thereto without departingfrom the scope and spirit of the invention, the embodiments beingillusive and not restrictive.

1. A method of detecting and tracking human faces across a sequence ofvideo frames, said method comprising the steps of: (a) forming a 3Dpixel data block from said sequence of video frames; (b) segmenting said3D data block into a set of 3D segments using 3D spatiotemporalsegmentation; (c) forming 2D segments from an intersection of said 3Dsegments with a view plane, each 2D segment being associated with one 3Dsegment; (d) in at least one of said 2D segments, extracting featuresand grouping said features into one or more groups of features; (e) foreach group of features, computing a probability that said group offeatures represents human facial features based on the similarity of thegeometry of said group of features with the geometry of a human facemodel; (f) matching at least one group of features with a group offeatures in a previous 2D segment and computing an accumulatedprobability that said group of features represents human facial featuresusing probabilities of matched groups of features; (g) classifying each2D segment as a face segment or a non-face segment based on saidaccumulated probability of at least one group of features in each ofsaid 2D segments; and (h) tracking said human faces by finding anintersection of 3D segments associated with said face segments with atleast subsequent view planes.
 2. A method according to claim 1, whereinsaid features are regions in said 2D segment which are darker than therest of said 2D segment.
 3. A method according to claim 1, wherein saidfeatures are regions in said 2D segment having edges.
 4. A methodaccording to claim 1, wherein said group of features forms a triangle.5. A method according to claim 1, wherein said method comprises thefurther steps of: determining, for each said 2D segment, a first measureof said 2D segment having a colour of human skin; and eliminating 2Dsegments having said first measure below a first predetermined thresholdfrom further processing.
 6. A method according to claim 1, wherein saidmethod comprises the further step of: eliminating 2D segments having aform that is non-elliptical from further processing.
 7. A methodaccording to claim 1, wherein said method comprises the further stepsof: determining movements of said 2D segments from positions of previous2D segments associated with the same 3D segments; and eliminating 2Dsegments from further processing where said movement is below a secondpredetermined threshold.
 8. An apparatus for detecting and trackinghuman faces across a sequence of video frames, said apparatuscomprising: means for forming a 3D pixel data block from said sequenceof video frames; means for segmenting said 3D data block into a set of3D segments using 3D spatiotemporal segmentation; means for forming 2Dsegments from an intersection of said 3D segments with a view plane,each 2D segment being associated with one 3D segment; in at least one ofsaid 2D segments, means for extracting features and grouping saidfeature's into one or more groups of features; for each group offeatures, means for computing a probability that said group of featuresrepresents human facial features based on the similarity of the geometryof said group of features with the geometry of a human face model; meansfor matching at least one group of features with a group of features ina previous 2D segment and computing an accumulated probability that saidgroup of features represents human facial features using probabilitiesof matched groups of features; means for classifying each 2D segment asa face segment or a non-face segment based on said accumulatedprobability of at least one group of features in each of said 2Dsegments; and means for tracking said human faces by finding anintersection of 3D segments associated with said face segments with atleast subsequent view planes.
 9. An apparatus according to claim 8,wherein said features are regions in said 2D segment which are darkerthan the rest of said 2D segment.
 10. An apparatus according to claim 8,wherein said features are regions in said 2D segment having edges. 11.An apparatus according to claim 8, wherein said group of features formsa triangle.
 12. An apparatus according to claim 8, wherein saidapparatus further comprises: means for determining, for each said 2Dsegment, a first measure of said 2D segment having a colour of humanskin; and means for eliminating 2D segments having said first measurebelow a first predetermined threshold from further processing.
 13. Anapparatus according to claim 8, wherein said apparatus furthercomprises: means for eliminating 2D segments having a form that isnon-elliptical from further processing.
 14. An apparatus according toclaim 8, wherein said apparatus further comprises: means for determiningmovements of said 2D segments from positions of previous 2D segmentsassociated with the same 3D segments; and means for eliminating 2Dsegments from further processing where said movement is below a secondpredetermined threshold.
 15. A computer-executable program stored on acomputer readable storage medium, the program for detecting and trackinghuman faces across a sequence of video frames, said program comprising:code for forming a 3D pixel data block from said sequence of videoframes; code for segmenting said 3D data block into a set of 3D segmentsusing 3D spatiotemporal segmentation; code for forming 2D segments froman intersection of said 3D segments with a view plane, each 2D segmentbeing associated with one 3D segment; in at least one of said 2Dsegments, code for extracting features and grouping said features intoone or more groups of features; for each group of features, code forcomputing a probability that said group of features represents humanfacial features based on the similarity of the geometry of said group offeatures with the geometry of a human face model; code for matching atleast one group of features with a group of features in a previous 2Dsegment and computing an accumulated probability that said group offeatures represents human facial features using probabilities of matchedgroups of features; code for classifying each 2D segment as a facesegment or a non-face segment based on said accumulated probability ofat least one group of features in each of said 2D segments; and code fortracking said human faces by finding an intersection of 3D segmentsassociated with said face segments with at least subsequent view planes.16. A program according to claim 15, wherein said features are regionsin said 2D segment which are darker than the rest of said 2D segment.17. A program according to claim 15, wherein said features are regionsin said 2D segment having edges.
 18. A program according to claim 15,wherein said group of features forms a triangle.
 19. A program accordingto claim 15, wherein said program further comprises: code fordetermining for each said 2D segment, a first measure of said 2D segmenthaving a colour of human skin; and code for eliminating 2D segmentshaving said first measure below a first predetermined threshold fromfurther processing.
 20. A program according to claim 15, wherein saidprogram further comprises: code for eliminating 2D segments having aform that is non-elliptical from further processing.
 21. A programaccording to claim 15, wherein said program further comprises: code fordetermining movements of said 2D segments from positions of previous 2Dsegments associated with the same 3D segment; and code for eliminating2D segments from further processing where said movement is below asecond predetermined threshold.